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Abstract 


This paper analyzes the performance of multi layer eed 
forward Neural Networks with Gradient descent with 
momentum and adaptive back propagation (TRAINGDX) 
and BFGS quasi-Newton back propagation (TRAINBFG) 
for hand written Hindi Characters of SWARS. In this 
analysis, five hand written Hindi characters of SWARS 
from different people are collected and stored as an 
image. The MATLAB function is used to determine the 
densities of these scanned images after partitioning the 
image into 16 portions, These 16 densities for each 
Characier are used as an input pattern for the two different 
Neural Network architectures. The two learning rules as 
the variant of Back Propagation learning algorithm are 
used to train these Neural Networks. The performance of 
these two Neural Networks are analyzed for convergence 
and trends of error in the case ofnon convergence. There 
are some interesting and important observations which 
have been considered for trends of error in the case of 
non-convergence. The inheritance of local minima 
problem of back propagation algorithm massively affects 
these two proposed learning algorithm also. 
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Introduction 


—e is a basic property of all human 
beings; when a person sees an object, he or she 
first gathers all information about the object and 
compares its properties and behavior with the 
existing knowledge stored in the mind. If we find 
a proper match, we recognize it [1]. The concept 
of recognition is simple in the real world 
environment, but in the world of computer 
science, recognizing any object is an amazing 
feat, The functionality of the human brain is 
amazing; itis not comparable with any machines 
or software. The act of recognition can be 
divided into two broad categories viz. (a) 
concrete item recognition. It involves the 
recognition of spatial samples such as 
fingerprints, weather maps, pictures and 
physical objects and the recognition of temporal 
samples such as waveforms and signatures. (b) 
Abstract item recognition. It involves the 
recognition of a solution to a problem, an old 
conversation or argument. 


More generally the pattern recognition spans a 
number of scientific disciplines, uniting them in 
search for a solution to the common problem of 
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recognizing the pattern of a given class and 
assigning the name of the identified class. 
Pattern recognition is the categorization of input 
data into identifiable classes through the 
extraction of significant attributes ofthe data from 
irrelevant background details. A pattern class isa 
category determined by some common 
attributes, Therefore, a pattern is the description 
of a category member representing a pattern 
class. A pattern class is a family of patterns that 
shares some common properties. Pattern 
recognition by machine involves techniques for 
assigning patterns to their classes automatically 
with as little human intervention as possible. 
Pattern recognition aims to classify data 
(patterns) based on either a priori knowledge or 
on statistical information extracted from the 
patterns. The patterns to be classified are usually 
groups of measurements or observations, 
defining points in an appropriate 
multidimensional space [3]. In the literature, the 
complete pattern recognition system is defined 
with the components like: sensor that gathers the 
observations to be classified or described; a 
feature extraction mechanism that computes 
numeric or symbolic information from the 
observations; a classification or description 
scheme that does the actual job of classifying or 
describing observations, relying on the extracted 
features. A very common domain in computer 
science for pattern recognition is identified in 
terms of character recognition or classification in 
a precise manner. Thus our basic purpose to 
investigate the available methods and 
techniques are related with the character 
recognition process. 


Character recognition plays an important role in 
today's life [4]. It can solve many complex 
problems of real life, An example of character 
recognition is Handwritten English alphabets. 
The classic difficulty of being able to correctly 
recognize even typed optical language symbols 
is the complex irregularity among pictorial 
representations of the same character due to 
variations in fonts, styles and sizes. This 
irregularity undoubtedly widens when one deals 
with handwritten characters [5].Classification 
method designs are based on the following 
concepts: Member-roster concept- Under this 
template-matching concept, a set of patterns 


belonging to the same pattern is stored in a 
classification system. When an unknown pattern 
is given as input, it is compared with existing 
patterns and is placed under the matching 
pattern class. 

Common property concept : In this concept, 
the common properties of patterns are stored in 
a classification system. When an unknown 
pattern comes inside, the system checks its 
extracted common property against the 
common properties of existing classes and 
places the pattern/object under a class, which 
has similar, common properties. Clustering 
concept: Here, the patterns of the targefted 
classes are represented in vectors whose 
components are real numbers. So, using its 
clustering properties, we can easily classify the 
unknown pattern. If the target vectors are far 
apart in geometrical arrangement, it is easy to 
classify the unknown patterns. If they are nearby 
or if there is any overlap in the cluster 
arrangement, we need more complex algorithms 
to classify the unknown patterns [6-7]. There are 
various methods which have been proposed for 
the character classification or recognition in 
literature. Some of them have shown their 
effectiveness for this task. Initially the attempts to 
accomplish the task were basically confined in a 
statistical domain like Bayesian decision theory 
[8] but these methods have their own pros and 
cons. 


Bayesian decision theory is a system that 
minimizes the classification error. Bayesian 
decision theory has a conceptual clarity leading 
to an elegant numerical recipe. This algorithm 
can deal with a broader scope of stochastic 
models than classical algorithms. Nearest 
neighbor rule [9] is used to classify the 
handwritten characters. The distance measured 
between two character images is needed to use 
this rule. This algorithm works well when the 
target patterns are far apart. Training in the 
nearest neighbor rule is very fast. Linear 
classification or discrimination [7] deals with 
assigning a new point ina vector space to a class 
separated by a boundary. It is well suited to 
mixed data types. It can also handle non-linear 
cases and missing data. The results produced 
by the system are very easy to interpret. All the 
methods mentioned above have their limitations 
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such as Bayesian decision theory has 
computational difficulties .This means that the 
method has a difficulty filling in numerical 
details. The method also has an obligation to 
use prior information; if unavailable, the theory 
will not work properly. The basic problem with 
the nearest neighbor rule is it requires a large set 
of data and the query time is very slow. Because 
of the larger set of data, itis very prone to data 
error. Therefore if there was any irrelevant 
information entered into the system, the system 
will be easily misinterpret the results, The same 
problem of a larger data set occurs with the 
linear classification method [10]. 


The task of character recognition can easily be 
accomplished by a human being without 
involving much effort due to its complex 
structure and working in parallel mechanism. 
The structure of biological neural network [11] 
has been simulated and modeled in a serial 
fashion that provides parallelism through ANN. 
One of the important advantages of ANN is its 
adaptive nature [12] and due to this property 
many existing paradigms can be fused into it 
easily. The powerful attribute of neural network 
is the ability to learn arbitrary non-linear 
mapping using one of the appropriate learning 
rules, Once the ANN system has trained, it can 
use for the pattern classification [13], pattern 
association, pattern mapping, pattern grouping 
[13], feature mapping pattern, optimization 
control etc. To accomplish the task of pattern 
classification and pattern mapping, the 
supervised multilayer feed forward neural 
network [14, 15] is considered with non-linear 
differentiable function in all processing units of 
output and hidden layers. The number of 
processing units in the input ‘layer, 
corresponding to the dimensionalities of the 
input pattern, are linear. The number of output 
units corresponds to the number of distinct 
classes in the pattern classification. A method 
has been developed [16], so that the network 
can be trained to capture the mapping explicitly 
in the set of input-output pattern pair collected 
during an experiment and simultaneously 
expected to model the unknown system for 
function from which the predictions can be 
made for the new or untrained set of data. The 
possible output pattern class would be 


approximately an interpolated version of the 
output pattern class corresponding to the input 
learning pattern close to the given test input 
pattern. This method involves the back 
propagation-learning rule [17] based on the 
principle of gradient descent along the error 
surface in the weight space. This algorithm is 
used for the training of a supervised multi-layer 
feed forward neural network, so that the network 
could be trained to capture the missing implicit 
pattern and generate the classification for 
different features in the given set of input-output 
pattern pairs, 


Character classification problem is related to 
heuristic logic as human beings can recognize 
characters and documents by their learning and 
experience. Hence neural networks which are 
more or less heuristic in nature are extremely 
suitable for this kind of problem. Various types of 
neural networks are used for OCR classification. 
But it has been found that most of the work for 
Character Classification or Recognition using 
neural network techniques is concentrated on 
English Character Recognition [49]. English 
Character Recognition (CR) has been 
extensively studied in the last half century and 
progressed to a level sufficient to produce 
technology driven applications. But the same is 
not in the case of Indian languages which are 
complicated in terms of structure and 
computations. Rapidly growing computational 
power may enable the implementation of Indic 
CR methodologies. Digital document 
processing is gaining popularity for application 
to office and library automation, bank and postal 
services, publishing houses and 
communication technology. The study 
investigates the direction of the Devnagari 
Optical Character Recognition Research 
(DOCR), analyzing the limitations of 
methodologies for the system which can be 
classified based upon two major criteria: the 
data equation process (on-line or off-line) and 
the text type (machine-printed or hand- 
written).No matter to which class the problem 
belongs, in general there are five major stages in 
aDOCR problem: 


1. Pre-processing 


2. Segmentation 


3. Feature Extraction 
4. Recognition 
5, Post processing 


These methods are very common and apply to 
almost every method or technique which is used 
for character recognition apart from the 
language. Handwriting Recognition technology 
has been improving much under the purview of 
pattern recognition and image processing since 
a few decades. Hence various soft computing 
methods involved in other types of pattern and 
image recognition can as well be used for DOCR. 
Seminal and comprehensive work in DOCR has 
been described [18-24]. A general review of 
Statistical Pattern Recognition can also be found 
in [25-28]. These can be taken as a good starting 
points to reach the recent studies in various 
types and applications of DOCR problems. An 
excellent overview of document analysis can 
also be found in [29]. 


In the present paper, we are exploring the 
necessary steps. Those have been discussed 
above for the implementation and analysis of the 
feed forward multilayer neural networks for the 
hand written Hindi characters of SWARS. In our 
proposed method. Multilayer feed forward 
neural. networks will train with two learning 
algorithms Those are the variants of generalized 
delta learning rule namely Quasi-Newton back- 
propagation learning algorithm and gradient 
descent with momentum and adaptive back 
propagation method for the training set of the 
handwritten Hindi characters of SWARS. These 
algorithms are used to analyse the performance 
of neural networks for the given training set. This 
paper represents the analytic study for the 
behavior of a neural networks system. The 
analytic study indicates some important 
observations and characteristics about the 
nature of error in the case of non-convergence. 


Section 2 of this paper describes the features of 
Hindi characters of SWARS which are required 
for feature extraction. Section 3 discusses the 
approach which is used for feature extraction of 
input stimuli used for the training. Section 4 of 
this paper describes the generalized delta 
learning rule in detail and the description of used 
learning algorithms. These learning algorithms 
are the variants of back-propagation or 
generalized delta learning rule. The description 
of these methods represents the implementation 
details of these methods which are used in the 
above paper. Section 5 describes the 
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architecture and design of the network used, in 
terms of simulation design. Section 6 shows the 
results of the experiments with the discussion 
Section 7 describes the conclusion and the 
future work based on this analysis. 


2.FEATURES OF HINDI SWARS 
CHARACTERS 


India is a multi-lingual and multi-script country 
comprising of eighteen official languages. One 
of the defining aspects of Indian script is the 
repertoire of sounds it has to support. Because 
there is typically a letter for each of the 
phonemes in Indian languages, the alphabet set 
tends to be quite large. Most of the Indian 
languages originated from BRAMHI script. 
These scripts are used for two distinct major 
linguistic groups, Indo-European languages in 
the north, and Dravidian languages in the south 
[30]. DEVNAGARI is the most popular script in 
India. Ithas 11 vowels and 33 consonants. They 
are called basic characters. Vowels can be 
written as independent letters, or by using a 
variety of diacritical marks which are written 
above, below, before or after the consonant 
they belong to. When vowels are written in this 
way they are known as modifiers and the 
characters so formed are called conjuncts. 
Sometimes two or more consonants can 
combine and take new shapes. These new 
shape clusters are known as compound 
characters. These types of basic characters, 
compound characters and modifiers are present 
not only in DEVNAGARI but also in other scripts. 
Hindi, the national language of India, is written in 
the DEVNAGAR | script and also Hindi is the third 
most popular language in the world [28] which 
consists of SWARS and VYANJANS. There are 
13 SWARS and 34 VYANJANS. In this paper we 
are concerned about the recognition of SWARS 
characters of Hindi language. So, our domain of 
problem is restricted only to the description of 
SWARS. A sample of Hindi SWARS characters 
setis provided in table 1. 


SWARS|s1/30|s|2|sla)ae |e le bit (str fe for] 
Table 1: Characters in Hindi SWARS. 
Table 1: Characters in Hindi SWARS. 


All the characters have a horizontal line at 
the upper part, known as SHIROREKHA or 
headline. No English character has such a 
characteristic and so it can be taken as a 
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distinguishable feature to extract English from 
these scripts. In handwritten form, from left to 
tight direction, the SHIROREKHA of one 
character joins with the SHIROREKHA of the 
previous ornext character of the same word. 
In this fashion, multiple characters and modified 
shapes in a word appear as a single connected 
component joined through the common 
SHIROREKHA., All the characters and modified 
shapes in a word appear to hang from the 
hypothetical SHIROREKHA of the word. Also in 
DEVNAGIRI there are vowels, consonants, 
vowel modifiers, compound characters, an 
numerals, Moreover, there are many similar 
shaped characters. All these variations make 
DOCR a challenging problem [31]. So far as 
concern with the Hindi language characters of 
SWARS, the features are the same as of any 
DEVNAGIRI script. So , these features can be 
considered in the same form. 


Feature extraction 


Feature extraction and selection can be defined 
as extracting the most representative information 
from the raw data, which minimizes the within 
class pattern variability while enhancing the 
between class pattern variability. For this 
purpose, a set of features are extracted for each 
class thal helps distinguish it from other classes, 
while remaining invariant to characteristic 
differences within the class. 


In this paper, we have considered the feature 
extraction from the input stimuli by using the 
density function of MATLAB. In our approach, we 
have considered the input data in the form of five 
different sets of each handwritten SWARS of 
Hindi characters by five different pegple. It is 
quite natural that five different people considered 
different handwriting and different writing style 
for every character, So, in this way we have atotal 
of 65 samples of the character sets, Each 
character set contains different examples of the 
same sample in different handwriting as shown 
in figure 1. Now to prepare our training set of 
input-output pattern pairs, we consider each 
scanned hand written character as a bit map 
image. This bitmap image of a character is now 
partitioned in 16 equal parts. After this, the row- 
wise and column-wise mean of each partition is 
obtained by using coded MATLAB function 


“Mean_character_recognition.m”. By doing this 
we obtained 16 real number values for each 
scanned image. Hence every scanned image is 
now considered in the form of an input pattern 
vector of 16 dimensions. Thus we consider each 
input pattern vector in the form of 16X1 row 
matrix form. The example of one such character 
and its representation in input pattern vector 
form can be shown in figure 2. 


So that in this way we can determine the input 
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Figure 7. Scanned images of five different 
samples of handwritten Hindi SWARS 
(2.514091; 2.135156; 2.292104; 2.433057; 
( 2.016629; 2.177502; 2.102489; 2.085017; 
1,924126; 2.103291, 2.158773; 2.2467367; 
1.893306; 2.180586; 2.324606; 2.457668) 


Figure 2. The input pattern vector of order 
(16X1) for character v from MATLAB function 


pattern vector for every scanned image. Thus we 
have the training set which consists of 65 input 
pattern vectors of size 16x1, It means if we 
consider the entire training set as a matrix of 
training pattern then it will be the order of 16x65. 
To distinguish each character set from another 
character set for classification during the training _ 


the target output is needed. Therefore to classify 
these characters of Hindi SWAR there must be 13 
different classes. As we know a single neuron 
can differentiate between two classes so to 
differentiate 13 different classes we required 
four output neurons. Hence the target output 
pattern for each input pattern will be of dimension 
four. Inthis proposed method we are considering 
the target output pattern for each character in the 
four bit binary form as shown in figure 3. 


[st Jen] [Se [= | ey | 8 [ot ate [or 
o/;/o0;ajaq;a/;/0 " 4 1/4 b ” 
o;o|;o|]1 as a)/1 /O0/;0 o/o|1 q 
oO i | 1/8 |\O;t)a 1S oO % | 4 o|;o 
1 )o tPpO1d Port |}ol4 o|1 o 5 | 


Figure 3; Target output pattern for different 
handwritten Hindi SWARS 


Thus, we have constructed the training set of 
input output patterns pairs to analyze the 
performance of multilayer feed forward neural 
networks with two mentioned algorithms. We 
have also constructed our test pattern set to 
verify the performance of networks. One test 
pattern set is consists of another set of hand 
written characters i.e. two sets each by two 
different people. Hence our test pattern set 
counsistant with 26 samples.The input patterns 
for these test character set are constructed in the 
same manner as forthe training set pattern. 


Neural network for character recognition 


Neural network architecture can be classified as 
feed- forward and feedback (recurrent) 
networks. The most common neural networks 
used in the OCR systems are the multilayer 
perceptions (MLP) of the feed forward networks 
and Kohonen's Self Organizing Map (SOM) of 
the feedback networks. One of the interesting 
characteristics of MLP is that in addition to 
classifying an input pattern, they also provide a 
confidence in the classification [26]. These 
confidence values may be used for rejecting a 
test pattern in case of doubt. MLP is proposed by 
U. Bhattacharya et a/. [32, 33]. A detailed 
comparison of various NN classifiers is made by 
M. Egmont-Petersen [34], He has shown that 
feed-forward, perception higher order network, 
Neuro-fuzzy systems are better suited for 
character recognition [35]. K. Y. Rajput et al. [36] 
used back propogation type NN classifier. 


Genetic algorithm based feature selection and 
classification along with fusion of NN and 
Fuzzy logic is reported in English (37, 38] but 
no work is reported for Indian languages. In this 
paper to accomplish the task of classification for 
Hindi characters of SWARS, we are using the 
feed forward multilayer neural network with 
generalized delta learning rule and its two 
variants viz. Quasi-Newton method and 
conjugate descent gradient with adaptive 
learning. 


4.1 Generalized delta learning rule 


The back propagation (BP) learning algorithm is 
currently the most popular supervised learning 
tule for performing pattern classification tasks 
[39, 40]. It is not only used to train feed forward 
neural networks such as the multilayer 
perceptran, it has also been adapted to recurring 
neural networks [41]. The BP algorithm is a 
generalization of the delta rule, known as the 
least mean square algorithm [39]. Thus, it is also 
called the generalized delta rule. The BP 
overcomes the limitations of the perceptron 
learning enumerated by Minsky and Papert [42]. 
Due to the BP algorithm, the MLP can be 
extended to many layers. The BP algorithm 
propagates backward, the error between the 
desired signal and the network output through 
the network. After providing an input pattern, the 
output of the network is then compared with a 
given target pattern and the error of each output 
unit calculated. This error signal is propagated 
backward and a closed-loop control system is 
thus established. The weights can be adjusted 
by a gradient-descent based algorithm. In order 
to implement the BP algorithm, a continuous, 
non-linear, monotonically increasing, 
differentiable activation function is required as 
Logistic Sigmoid function or hyperbolic tangent 
function. 


We want to train a multi-layer feed forward 
network by gradient descent to approximate an 
unknown function, based on some training data 
consisting of pairs( -)-:. The vector x represents a 
pattern of input to the network and the vector z 
the corresponding desired output from the 
training set S, The objective function for 
optimization is defined in the sum of the 
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instantaneously squared error as: 


i 
*=5 0, -5,)" (4.1.1) 
= 
qj? 


where (7, -5,)' is the squared difference between 
the actual output of the network on the output 
layer for the presented input pattern P and the 
target output pattern vector for the pattern P. 


Allthe network parameters W"""’ and9" m= 
2+ + + M,can be combined and represented 
by the matrix 1” =v, |. The error function E can be 
minimized by applying the gradient-descent 
procedure as: 
oE 
ow 
where.... isalearning rate or step size, provided 
thatitis a sufficiently small positive number. 


A Weer 


(4.1.10) 


Applying the chain rule, the equation (4.1.10) 
can expressed as: 
cE OE = tl 


Fy vr) = (el Pr a “awe 


(4.1.11) 


Ou} 
while 


au) 


ay) 
ow} 


Gn) Cm) 4. (net) fn) (4.1. 12) 
soiree sofr)o 
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as (mt) 
OE __ of ao OE a a} 
auf) aoe") ane Bo ‘iz 1) / 
For the output unit m=M-1 (4.1.13) 
OE 
awd =i (4.1.14) 
For the hidden units, m = 1,2,3......... M-2 
oE 


doy ay wel) 


a(n) 02 

lara (4.1.16) 
Uy 

for M = M = 2,3........ ;M. By substituting 


(4.1.11), (4.1.15) and (4.1.16) into (4.1.13), we 
finally obtain the following. For the output units, 
m=M-1 


300. Sa o6OQE) (4.1.17) 


For hidden units, m = 1 


Tua 
3 (u he -e a ul) S57 .02"7 (4.1.18) 
o=l 
Equations (4.1.17) and (4.1.18) provide a 
recursive method to solve 6 (et): for the whole 

network, Thus, W can be adjusted by 


oF 3 1) () 
Paces a (4.1.19) 


ow; 


For the activation functions, we have the 
following relations. For the logistic function: 


$)= BOW oe) (4.4.20) 
For the tan function 
@)=p[-o2@)| (4.1.21) 


The update for the biases can be in two ways. 
The biases in the (m+1)th layer 6"'” can be 
expressed as the expansion of the weight W, 
thatis, 


0? 


the output o(m) is expanded into 


o ) =(. 0 {wm Dy created ol ) 


Another way is to use a gradient-descent 
method with regard to 6”, by following the 
above procedure, Since the biases can be 
treated as special weights, these are usually 
omitted in practical applications, The algorithm 
is convergentin the mean if 


Accordingly, 


O<n< 


, where Amax is the largest 


man 
eigen value of the autocorrelation of the vector 
x, denoted as C [44], When n is too small, the 
possibility of getting stuck at a local minimum of 
the error function is increased, In contrast, the 
possibility of falling into oscillatory traps is high 
when n is too large, By statistically 
preprocessing the input patterns, viz. de- 
correlating the input patterns, the excessively 
large eigen values of C can be avoided and thus, 
increasing n can effectively speed up the 
convergence. PCA preconditioning speeds up 
the BP in most cases, except when the pattern 
set consists of sparse vectors. In practice, n is 
usually chosen to be 0 < n < 1 so that 
successive weight changes do not overshoot — 


the minimum of the error surface. The BP 
algorithm can be improved by adding a 
momentum term [17].It is known as Gradient 
Descent with momentum term. In our proposed 
analysis we are using this learning rule in 
MATLAB as TRAINGDX (Gradient descent with 
momentum and adaptive back propagation). As 
per this learning rule the weight update between 
output layer and hidden layer is represented by 
following the weight updating equations as: 


whereas the weight update between hidden layer 
and input layer can be represented as: 


OE 1 
Aw, (s+1)=- +aAw, + 
w= Se ani) Ea, 


(4.1.22) 


nm 


Where a is the momentum factor, usually 0 < a< 
1. This method is usually called the BP with 
momentum (BPM) algorithm. 


Aw; (s+1)= ay +amm, (s+ 4 
dat OW, 


Taney 1-28) 
The BP algorithm is a supervised gradient- 
descent technique, wherein the MSE between 
the actual output of the network and the desired 
output is minimized. It is prone to local minima in 
the cost function. The performance can be 
improved and the occurrence of local minima 
reduced by allowing extra hidden units, lowering 
the gain term and with modified training with 
different initial random weights. These are 
efficient variant of back propagation learning 
algorithm, known as the Quasi-Newton method, 
which is used to improve the performance of feed 
forward multilayer network architecture for the 
given training set. 


4.2 Quasi-Newton methods 


Quasi-Newton method is very often implemented 
with BFGS method, The BFGS method for 
multilayer feed forward Neural Network training. 
Quasi-Newton methods approximate Newton's 
direction without evaluating second order 
derivatives of the cost function, The 
approximation of the Hessian or its inverse is 
computed in an iterative process. They are a 
class of gradient-based methods whose descent 
direction vector d(t) approximates the Newton's 


direction[48]. 
d(t)=—-H(t)g(t) (4.2.10) 


Thus, one can obtaind (7) with: 
H(t)d(t) =—g(t) (4.2.11) 


The Hessian is always symmetric and is often 
positive-definite. Quasi - Newton methods with 
positive-definite, Hessian is called variable- 
metric methods. Secant methods are a class of 
variable-metric methods that use differences to 
obtain an approximation to the Hessian, These 
methods approximate the classical Newton's 
method, thus the convergence is very fast. 


There are two globally convergent strategies 
available, namely, the line-search and trust- 
region methods. The line-search method tries to 
limit the step size along the Newton's direction 
until it is unacceptably large, whereas in the trust- 
region method the quadratic approximation of 
the cost function can be trusted only within a 
small region in the vicinity of the current point, 


In quasi-Newton methods, a line search is 
applied such that 


A(t) = arg min E( w(t) + Ad (t)) (4.2.12) 
And i(t+1) = w(t) +A(A)d(t) (4.2.13) 


Line search is used to guarantee at each iteration 
the objective function decay that is dictated by 
the convergence requirement. The optimal 
X(t) can be theoretically derived from 


Bhi ad(t) =0 (4.2.14) 
OL 


And this yields a representation using the 
Hessian. The second-order derivatives are 
approximated by the difference of the first-order 
derivatives at two neighboring points and thus 
i is calculated by 


‘ 
2(t) = atg(t) dit) (4.2.15) 


~ dit) [g. O-gOkO 
where g,(¢)=V,,E(ii(t)+td(t)), 


and the size of the neighborhood T is carefully 
selected. Some inexact line-search and line- 
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search-free optimization methods are applied to 
quasi-Newton methods, which are further used 
for training FNNs [45]. 


There are many secant methods of rank one or 
two, The Broyden family is a family of rank-two 
and rank-one methods generated by taking [46]. 


A(t) =(1— P)H pep (t) + PH gras (t) (4.2.16) 


where Hp and H yngs are, respectively, the 
Hessian obtained by the Davidon-Fletcher- 
Powell(DFP) and BFGS methods and (9 a 
positive constant between 0 and 1. By giving 
different values , one can getthe DFP ({9= 0), the 
BFGS ( §? =1), or other rank-one or rank-two 
formulae. The DFP and BFGS methods are two 
dual rank-two secant methods and the BFGS 
emerges as a leading variable-metric contender 
in theory and practice [47]. 


The BFGS method [44, 46, 47] is implemented 
as follows. Inexact line search can be applied to 
the BFGS and this significantly reduces the 
number of evaluations of the error function. The 
Hessian H or its inverse is updated by: 


H(t)s(t)s’ (t)H(t) rr 2(t)z" (1) 


H(t+1)=H(t))-—> 
eae) s(OH(Os(t) 87 (z(t) 
(4.2.17) 
Mepyatine 1 
H+) =H) + 14 2 OE O28) S05 O 
s'(t)2(t) s (t)z(t) 
_ 02H '(W)+ H"(t)z(0)s"(t) (4.2.18) 
s’(t)2(t) 
where 
z(t) = g(t+1)—g(t) (4.2.19) 
s(t) = w(t +1) -— Wt) (4.2.20) 


:For its implementationyi (0), g(0), and A! (0) 
are needed to be specified. H' is typically 
selected as the identity matrix. 


5. Simulation design and implementation 


In this simulation framework, we have to 
consider two different multilayer feed forward 
neural network architectures, first consists of 16 
neurons in the input layer, 5 neurons in the 
hidden layer and 4 neurons in the output layer 
while the second architecture consists of 16 
neurons in the input layer, 10 neurons in the 
hidden layer and 4 neurons in the output layer. 


The number of neurons in input layer and output 
layer are selected on the basis of dimensionality 
of input output patterns pair, In both the 
architecture the single hidden layer is used with 
a different number of neurons. Both networks 
are trained with improved variants of back 
Propagation learning rule namely Gradient 
descent with momentum and adaptive back 
Propagation (TRAINGDX) and BFGS Quasi- 
Newton back propagation (TRAINBFG). These 
two learning methods are considered from the 
MATLAB simulation tool as neural network. The 
first learning rule which we have used to trained 
neural network architecture for the given training 
set of input pattern samples for hand written 
Hindi characters of SWARS is Gradient descent 
with momentum and adaptive back propagation 
(TRAINGDX). The second algorithm which we 
have used for the same training set is BFGS 
Quasi-Newton back propagation (TRAINBFG). 
The implementation description of these two 
learning methods is defined in the following 
subsection, 


5.1 Gradient descent with momentum and 
adaptive back propagation (traingdx) 


In MATLAB this learning rule is implemented with 
the function name TRAINGDX. TRAINGDX is a 
network training function that updates weight 
and bias values according to gradient descent 
momentum and an adaptive learning rate. The 
TRAINGDX function considers the following 
inputs as shown in table 2. 


NET Neural network 

Pd Delayed input vectors 

Tl Layer target vectors 

Ai Initial input delay conditions 

Q Batch size 

TS Time steps 

Ww Empty matrix [] or structure 
of validation vectors 

TV Empty matrix [] or structure 
of test vectors 


Table 2. Input information for function 
TRAINGD: 


This function returns the following output values 
as shown in table 3. 


NET Trained network 

TR Training record of various 
values over each epoch: 

TR.epoch Epoch number 

TR.pert Training performance 

TR.vperf Validation performance 

TR.tperf Test performance 

TRAr Adaptive learning rate 

Ac Collective layer outputs for 
last epoch 

El Layer errors for last epoch 


Table 3. Output values from the function 
TRAINGDX 

Parameters with their values are used to train the 
neural network architecture with TRAINGDX learning 
tules are shown in table 4. 


| Epochs 1000 |Maximum number of | 
epochs to train 


Goal 0 Performance goal | 

Lr 0.01 |Learning rate | 

Ir_inc 1.05 Ratio to increase | 
learning rate 

Ir_dec 0.7 |Ratio to decrease 
learning rate 

max_fail 5 Maximum validation 


failures 


max_perf_inc | 1.04 |Maximum performance 


Table 4. Parameters used in TRAINGDX 

during training 

TRAINGDX can train any network as long as its 

weights, net input and transfer function have 

derivative functions. In this learning rule back 
propagation is used to calculate derivative of 
performance with respect to weights and bias 
value, each variable is adjusted according to the 
gradient descent with momentum as described 
in equation (4.1.22) and (4.1.23). In this rule the 
learning rate is of an adaptive nature. For each 
epoch if performance decreases towards the 
goal then the learning rate is increased by factor 

Ir_inc. If performance increases by more than 

factor max_perf_inc the learning rate is adjusted 

by the factor Irdec and the change, which 

increased the performance, is not made. The 

training in this learning function can be stopped 

due to any one of the following conditions: 

1)The maximum number of EPOCHS 
(repetitions) is reached. 

2) The maximum amount of TIME has been 
exceeded. 

3) Performance has been minimized to the 
GOAL. 

4) The performance gradient falls below 
MINGRAD. 

5) Validation performance has increased more 
than MAX_FAIL times since the last time it 
decreased (when using validation). 

5.2 BFGS QUASI-NEWTON 

BACKPROPAGATION (TRAINBFG) 

In MATLAB this learning rule is implemented with 

the function name TRAINBFG. TRAINBFG is a 

network training function that updates weight 

and bias values according to BFGS quasi- 

Newton method. The TRAINBFG function 


increase considers the following inputs as shown in 
Me 0.9 |Momentum constant. tapIS5 
min_grad 1e-10| Minimum performance NET Neal near 
gradient Pd Delayed input vectors. 
Tl Layer target vectors. 
Show ee Seas oy Ai Initial input delay conditions. 
preys Q Batch size. 
(NaN for no displays) Ts Time steps. 
Ti Inf |Maximum time to train Ww Either empty matrix [] or structure of 
me Me in seconds validation vectors. 
TV Either empty matrix [] or structure of test 
vectors. 
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Table 5. Input information for function TRAINBFG 


NET Trained network | 

TR Training record of various 
values over each epoch: 

TR.epoch Epoch number 

TR.perf Training performance 

TR.vpert Validation performance 

TR.tperf Test performance 

Ac Collective layer outputs for last 
epoch 

El Layer errors for last epoch 


Table 6. Output values from the function 
TRAINBFG 

These following parameters with their values are 
used to train the neural network architecture with 
TRAINBFG learning rule are shown in table 7. 


Epochs 1000 


Maximum number of 
epochs to train 


Show 25 Epochs between displays 
(NaN for no displays) 

Goal 0 Performance goal 

Time Inf Maximum time to train in 
seconds 

min_grad | 1e-6 Minimum performance 
gradient 

max_fail 5 Maximum validation 
failures 


SearchFen) 'srchcha’ | Name of line search 


routine to use 


Table 7. Parameter used in TRAINBFG during 
training 


TRAINBFG can train any network as long as its 
weights, net input and transfer functions have 
derivative functions. In this learning rule back 
propagation is used to calculate derivative of 
performance with respect to weights and bias 
value, each variable is adjusted according to the 
Newton quasi learning equation as described in 
equation (4.2.18). The training in this learning 
function can be stopped due to any one of the 
following conditions: 


1) The maximum number or EPOCHS 
(repetitions) is reached. 


2) The maximum amount of TIME has been 
exceeded. 


3) Performance has been minimized to the 
GOAL. 


4) The performance gradient falls below 
MINGRAD, 


5) Validation performance has increased more 
than MAX_FAIL times since the last time it 
decreased (when using validation). 


6. Results and discussions 


The result showed as in tables 8 and 9 for the 
recognition of Hindi handwritten character 
SWARS using TRAINGDX and TRAINBFG 
exhibiting non-convergence for both networks, 
Table 8 is exhibiting the result for network one|.e. 
16_5_4. It can be seen from the result that the 
network is not converging for any character. In 
this non convergence case, the behavior of error 
for BFGS quasi-Newton back propagation 
learning algorithm is more consistent. In the 
result, it has been observed for both the training 
algorithm the error for the character , is minimum 
and forthe character _is nearto maximum, It has 
also been observed that the error for vkS is 
increasing for both learning algorithms. The 
results are also indicating that the BFGS quasi- 
Newton back propagation learning algorithm 
(TRAINBFG) is involving less error than Gradient 
descent with momentum and adaptive back 
propagation learning algorithm (TRAINGDX). In 
our experiments both the algorithms have 
trained for equal number epochs and it can be 
seen that BFGS quasi-Newton back 
propagation learning algorithm (TRAINBFG) 
seems to be more effective than Gradient 
descent with momentum and adaptive back 
propagation learning algorithm (TRAINGDX). 
The result of a second network i.e. 16_10_4 is 
also indicating the same performance for both 
the algorithms. The important point which can 
be observed in the second network is that the 
tendency for minimum error is common for both 
the algorithms. Both the algorithms are 
minimizing the error for the same character and 
increasing the error for same character, The 
another point which can be observed in BFGS 
quasi-Newton back propagation learning 
algorithm is that no error is in negative whereas 


in gradient descent with momentum and SAMPLES NETWORK1 
adaptive back propagation learning algorithm; 
for some characters, the error is negative. This TRAINGDX TRAINBFG 
tendency is found in both the networks. Another cA -0.19502 0,088245 
issue which can be observed is that for the 
TRAINBFG for some of the characters the error 2 eee pee 
are more than 1 but it is not happening with 3 0.065327 0.355528 
TRAINGDX. The reason of negative error in 
TRAINGDX for some characters is due to poor 2 ORE: EES 
approximation of the sample points. The ee 0.06164 0.355768 
domination of local errors for some of the = -0.00367 0.316997 
characters are also playing the dominate rule for | 
the non-convergence of given training set. The co 0.232999 0.528438 
analyses of the performance ; for both the Uy -0.1439 0.085387 
algorithms can also be seen in the graphic b 
representation as in figures 4 and 5. g -0.00926 0.323848 
SAMPLES NETWORK1 et ~0.02932 0,323767 
TRAINGDX | TRAINBFG it 0.222451 0.563787 
at 
z -0.23535 0.555367 <€ 0.010792 0.328683 
a -0.23545 0.529788 or | 0.19633 0.585789 
2 0.014364 0.7821 92 Table 5. Results for the training of handwritten 
¢ -0.22206 0.500621 Hindi characters of SWARS for Network 16-10-4 
= 0.014238 0.81352 with both training functions 
ow 0.01423 0.822936 06 . 
7 0.256107 1.02635 . A / 
4 , 
¥ 0.21767 | 0.441452 a pak 
sed —e—NETWORK? TRAINGDX 
% 0.030275 0.804958 bs j / A J -e-nerwora tainer 
” \ [\J 
at 0.01521 0.868561 ‘ ix / Va . 
o. sais Faw abe ot ot of 
ait 0.267964 1.131408 a9 
z 03 
Figure 5. The performance analysis for the 
training set from network 16-10-4 with both 


Table 4. Results for the training of handwritten 
Hindi characters of SWARS for Network 16-5-4 
with both training functions 


r 


| ACen 
= NETWORK1 
vel VY © TRAINGDX 
; AS NETWORK1 
_ /\/ 7 TRAINBFG 


Figure 4. The performance analysis for the training set from 
network 16-5-4 with both training functions 


training functions 
6. Conclusion 


The result of the experiments clearly shows that 
both networks are exhibiting non convergence 
for the given training set of handwritten Hindi 
characters of SWARS with both the training 
functions, It means that for any number of 
iteration the network will not converge at all for 
these two enhanced variants of back 
propagation, learning algorithms for the given 
training set of Hindi characters of SWARS. It is 
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expected if the number of neurons in the hidden 
layers are extremely larger than the error, may 


be further reduced in BFGS quasi-Newton back 
propagation learning algorithm but the non 
convergence definitely exists. The observation 
from the results for the performance of these two 
learning algorithms for the complex training set 
like handwritten character set of SWARS is 
reflecting the limitation of gradient descent 
learning algorithm for convergence due to the 
problem of local minima which is an inherent 
problem of back propagation learning algorithm. 
It shows that the local minima problem is the 
inherent feature of all the variants of gradient 
descent learning methods or any variant of the 
learning rule. It means that the Gradient descent 
methods are not likely to be perfect searching 
algorithms for global optimization. It explores the 
possibility of incorporation for the evolutionary 
search for similar types of complex pattern 
classification problems, og 
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Qoo00 


O; the two methods proposed to minimize the charge sharing 


problems in dynamic CMOS logic circuits at low voltage, it is 


clear that dynamic CMOS with bleeder minimize the charge sharing 


problems quit well when compared with internal node precharge. With 


internal node precharge better results can be obtained if the W/L of internal node 


precharge PMOS is increased. Various results obtained using T-Spice have also 


been shown and discussed. 
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